Skip to main content

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: xtifr Date: Mar 19, 2011 4:47pm
Forum: etree Subject: Re: a million songs yet?

Ok, that sounds like a good approach. It'll still require about 90k file downloads plus all the accesses required to figure out where the files are, but at least they'll be small files. I'll start with some small-scale experimentation for proof of concept, and set it up so that I can throttle it back if necessary when I do a full run.

Reply to this post
Reply [edit]

Poster: brewster Date: Mar 20, 2011 11:35am
Forum: etree Subject: Script to download files from lots of items on the Archive

Hank Bromley and the Collections group have gotten good at using wget to download files from lots of items on the Archive. I have included our internal documentation file on how to do it. Wish it were a web form, but we just have never gotten to it (any volunteers?).

This should be helpful to figure out how to download all the files.xml files, for instance.

I have also included a list of all public etree items including the grateful dead concerts. it makes over 97,000!



Reply to this post
Reply [edit]

Poster: xtifr Date: Mar 20, 2011 11:53am
Forum: etree Subject: Re: Script to download files from lots of items on the Archive

Thanks. I'm already pretty solid with wget (and curl), and in any case, for this task, I plan to use python's built-in urllib instead of spawning thousands of subprocesses. The list of identifiers will be very handy, though.

(I do know how to get identifiers from the advanced search API too, but since someone's already done the work, I have no problems taking advantage of it.) :)

Reply to this post
Reply [edit]

Poster: xtifr Date: Mar 23, 2011 1:17pm
Forum: etree Subject: Counting has started!

I've started running the script. Early results suggest that LMA shows probably average about 15 songs, which implies that the total will be well over a million!

It's going to take a while to get the final results because A) I've got it throttled to pause for a second between shows—97000 shows / 3600 (secs-per-hour) = 26 hours of idleness alone—and B) my Internet connection gets flakey in the rain. But I'll post the grand total as soon as it's available.

I don't expect it to, but if my hammering becomes a problem, please let me know at my gmail account, "xtifr.w". Thanks. (I do need to update my archive account one of these days.)

Reply to this post
Reply [edit]

Poster: Tyler Date: Mar 25, 2011 7:14pm
Forum: etree Subject: Re: Counting has started!

this is awesome news! I can't wait to see what it ends up being. crawl it away!